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Abstract 

Background: The avian influenza A H5N1 virus occasionally infects humans, with high mortality rates. Although all current 
human infections are from avian-to-human transmission, it has been shown that H5N1 can be evolved to transmit between 
mammals, and is therefore a pandemic threat. For H5N1 surveillance, it is of interest to identify the avian isolates most likely 
to infect humans. In this study, we develop a method to identify mutations significantly associated with avian to human 
transmission. 

Method: Using protein sequences for the surface glycoprotein hemagglutinin from avian and human H5N1 isolates in 
China, Egypt, and Indonesia from the years 1996-2011, we used Principle Component Analysis and a Maximum Likelihood 
Multinomial method to identify mutations associated with avian to human transmission. In each geographic region, 
transmission bias residues were identified using two signatures: a) significantly different amino-acid frequencies in human 
isolates compared to avian isolates from the same year, and b) significantly low probability of neutral evolution of the 
human isolates from the avian viral pool of the previous year. 

Results: In each geographic region, we find specific transmission bias mutations associated with human infections. These 
mutations are located in antigenic regions and receptor binding, glycosylation and polybasic cleavage sites of HA. We show 
that human isolates derive from a limited, subset of the avian pool characterized by geography specific mutations. In Egypt, 
two of three PCA clusters have very few human isolates but are highly enriched in mutations associated with a vaccine 
escape mutant H5N1 avian sub-clade that is known to be resistant to the Mexican H5N2 vaccine Furthermore, at these 
transmission bias associated residues, the mutations characteristic of these two clusters are distinct from those associated 
with the cluster enriched in human isolates, suggesting that vaccine resistant avian strains are unable to infect humans. Our 
results are relevant for surveillance and vaccination strategies for human H5N1 infections. 



Citation: Wagh K, Bhatia A, Greenbaum BD, Bhanot G (2014) Bird to Human Transmission Biases and Vaccine Escape Mutants in H5N1 Infections. PLoS ONE 9(7): 
el 00754. doi:1 0.1 371 /journal.pone.01 00754 

Editor: Elankumaran Subbiah, Virginia Polytechnic Institute and State University, United States of America 
Received March 5, 2014; Accepted May 30, 2014; Published July 2, 2014 

Copyright: © 2014 Wagh et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. The data on all the viral sequences used in the 
study was downloaded from the NCBI Influenza Virus Resource database (http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi), on August 8, 
2012 (Egyptian isolates) and October 18, 2012 (Asian isolates). 

Funding: BG was the Eric and Wendy Schmidt Member of the Institute for Advanced Study when this work was begun and would like to thank them for their 
support. GB thanks the Kavli Institute for Theoretical Physics at the University of California at Santa Barbara and the Aspen Center for Physics for partial support 
through NSF grant numbers NSF PHY1 1-25915 and 1066293 respectively, during the time this work was done. The funders had no role in study design, data 
collection and analysis, decision to publish, or preparation of the manuscript. 

Competing Interests: The authors have declared that no competing interests exist. 

* Email: gyanbhanot@gmail.com 



Introduction 

The H5N1 Influenza A avian virus is an existing pandemic 
threat [1-4]. Although human H5N1 infections occur rarely, such 
infections are usually accompanied by severe respiratory compli- 
cations with high morbidity, and a mortality rate approaching 
60% [5-6]. Infections in humans occur almost exclusively from 
direct human contact with infected wild birds or poultry. 
Currently, the poor human-to-human transmission efficiency of 
circulating H5N1 strains [7] limits their pandemic potential. 
However, this poor transmission can be overcome by evolution of 
H5N1 in mammalian hosts [8]. Laboratory studies of experimen- 



tally evolved H5N1 strains show that current strains can transmit 
efficiendy between mammals (ferrets) with only 4—5 substitutions at 
specific residues in hemagglutinin (HA) and polymerase basic 2 
(PB2) proteins [9,10]. Given the high mortality rate of human 
infections from currently circulating avian strains, there is an 
urgent need to identify which avian H5N1 strains are most likely 
to infect humans. 

Vaccination of poultry has been undertaken to control H5N1 
infections in several countries such as China, Egypt, Indonesia and 
Viet Nam, where H5N1 is endemic [11]. Several inactivated 
reassortant H5N1, H5N2 and H5N3 vaccines as well as vector 
vaccines have been developed and used. The vaccinations have 
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shown limited efficacy and vaccine resistance has been observed 
[12]. Therefore it is imperative that vaccines be redesigned for 
efficacy against the prevalent strains. In Egypt, for example, the 
attenuated H5N1 strain vaccines predominantly used in domestic 
poultry have changed from using a 1996 strain (until 2008), to 
2006 strains (2009-2012), to a 2009 strain (2012 onwards). 
However, as of 2011, the vaccine predominantly used in 
commercial poultry farms is a Mexican 1994 H5N2 strain vaccine 
[12]. It is thus important to understand how the H5N1 virus is 
evolving under vaccination induced selection pressure. This issue 
has important implications for surveillance of the avian viral pool, 
intelligent vaccine design, and, most significantly, identifying avian 
strains likely to jump into human hosts. 

Since almost all human H5N 1 infections so far were transmitted 
from avian hosts, investigating and understanding any biases in 
such inter-host-species transmission is important. Any observed 
signature of biased transmission from birds to humans could 
represent enhanced/diminished efficiency due to specific func- 
tional mutations possessed by certain H5N1 strains which cause 
them to infect human hosts with greater efficiency. Selection in 
H5N1 viruses infecting humans has been studied previously [13- 
15] using differences in the rates of synonymous and non- 
synonymous mutations in human isolates as the characteristic 
signature for selection. However, because H5N1 is transmitted 
from birds to humans, with littie to no known transmission 
between human hosts, such an analysis cannot distinguish between 
selection pressures on H5N1 from avian infections versus human 
transmissions. In other words, analysis of H5N1 viruses solely from 
human subjects cannot identify which mutations in the avian 
H5N1 pool are important for transmission from birds to humans, 
compared to those which only have a selective advantage in birds. 
In this paper, we develop a simple strategy to identify and interpret 
such H5N1 transmission bias mutations which are important in 
infections from avian to human hosts. 

A mutation conferring higher infectivity in human infections but 
neutral in avian infections would be over-represented in human 
isolates, but not in avian isolates (Figure 1). Such mutations would 
have two characteristic signatures: a) a significant difference in 
ammo-acid frequencies in human isolates compared to avian 
isolates from the same year and b) a significantly low probability of 
neutral evolution of the human isolates from the avian viral pool of 
the previous year. We applied these criteria to protein sequences of 
H5N1 avian and human isolates from 1996-2011 collected in 
China, Egypt and Southeast Asia. We analyzed strains from each 
geographic region separately, and corrected for population sub- 
structure within each region using Principal Component Analysis. 
For this study, we focused on the protein hemagglutinin because of 
its key role in host-cell receptor binding and antigenicity [16,17]. 
For each geographic region, on an annual resolution, our methods 
identified residues that show biased transmission from birds to 
humans. We find that, in each geographic region, strains infecting 
humans originate from a subset of the avian viral pool 
characterized by specific mutations at identified residues ("trans- 
mission bias mutations/residues"). The identified transmission bias 
residues are in immunologically and functionally relevant regions 
of HA, such as the epitope regions, the receptor-binding site, the 
polybasic cleavage site and the trans-membrane site. In Egypt, at 
transmission bias residues conserved in human isolates, we find 
that human isolates are significantly different from avian isolates 
resistant to Mexican H5N2 vaccine. This raises the possibility that 
these vaccine-resistant avian strains may be unlikely to infect 
humans. This result is important because it suggests that 
appropriate vaccine pressure in birds may drive the H5N1 virus 
away from being able to infect humans. However, further 
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Figure 1. Methods to detect transmission Bias of H5N1 strains 
from birds to humans. A hypothetical scenario of a mutation under 
positive selection for human infections, but selectively neutral in avian 
infections, will result in a transmission bias of H5N1 strains from birds to 
humans. Such a residue would show: a) Significant increase in the 
frequency of residues in human isolates compared to their frequency in 
the avian viral pool; b) Low probability that the current human isolates 
are derived from neutral evolution of the avian viral pool of the 
previous year. We use these two tests to identify transmission-bias in 
human infections. 

doi:1 0.1 371/journal.pone.01 00754.g001 

experimental and environmental studies are needed to verify this 
possibility. 

Results 

Human H5N1 isolates derive from a subset of avian 
viruses with geography-specific epitope profiles 

We analyzed 1209 HA sequences of H5N1 isolates from avian 
(n= 1056) and human hosts (n= 153) from China, Indonesia and 
Egypt, collected from 1 996-2011. Principal Component Analysis 
(PCA) was used to study population structure (Methods). PCA 
plots for HA sequences from each geographic region are shown in 
Figure 2. In each geographic region, human isolates cluster with 
subsets of avian isolates, suggesting a transmission bias in H5N1 
infections from avian hosts to humans. To characterize the subsets 
of avian isolates most likely to infect humans, we identified ciusters 
of closely related human and avian isolates, using a distance cutoff 
in PCA space (Figure SI, Methods). The identified clusters consist 
of most of the human isolates in each region: 30 out of 36 in 
China, 70 out of 71 in Egypt, and 46 out of 50 in Indonesia (details 
on cluster membership in Tables SI A-C). 

We used two signatures to identify transmission bias residues in 
avian to human infections: a) The residues should have a 
significant difference in amino-acid frequency in human isolates 
compared to avian isolates from the same year and b) The residues 
in human isolates should have a significantly low probability to 
derive from neutrally evolved avian viral isolates from the avian 
viral pool of the previous year. To evolve the avian pool neutrally 
from one year to the next, we adapted the method of Pan and 
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Figure 2. Population substructure and transmission bias in H5N1 strains. The figure shows the first two principal components from PCA of 
HA amino acid sequences from avian and human isolates from China (A), Egypt (B) and Indonesia (C). Closely clustering human and avian isolates, in 
each region, were identified using an algorithm which clusters strains by using a distance cutoff in principal component space (Methods). 
doi:1 0.1 371 /journal.pone.01 00754.g002 



Deem [18]. The expected frequencies of amino acids at a given 
locus in a given year were obtained by neutrally evolving the 
observed amino acid frequencies at this locus in the previous year, 
using an influenza-specific amino acid transition probability 
matrix calculated by Dang et al. [19]. Using the avian isolate 
frequencies (either actual or expected under neutral evolution 
from the previous year) as a-priori human isolate amino-acid 
frequencies, we used a multinomial analysis to estimate the 
probabilities of the observed human isolate amino-acid frequen- 
cies. A Jackknife test was used to determine significance cutoffs. 
The most significant residues thus identified are listed in Table S2 
by year and geography. Amino acid frequencies for these residues 
are in Table 1 and Table S3. 

Several of these residues have one high frequency (>80%) 
amino acid in the human isolates in each region (Figure 3 and 
Table 1). Most of the human isolates in each geographical region 
cluster together (80-99%) (Figure 2) and these amino acids are 
virtually conserved in these closely clustering human isolates. 
These amino acids are also almost conserved (frequencies >89%) 
in closely clustering avian isolates, but have low to intermediate 
frequencies (18-38%) in other avian isolates (see Methods for the 



definition of "closely clustering" isolates). As indicated by the 
similarity of amino acid frequencies at all transmission bias 
residues between human and closely clustering isolates (Table 1), 
we found that at these residues the strains infecting humans are 
most likely to have evolved neutrally from the closely clustering 
avian isolates (Table S2, columns E, F). These results, taken 
together, suggest that for each geographic region, human 
infections are significantly more likely to arise from an identifiable 
subset of avian isolates, characterized by specific amino acids at 
the residues identified in Table 1 , rather than from the entire avian 
viral pool. 

Many of the loci associated with the transmission bias are 
located in or near functional regions of HA, such as the epitope 
regions (corresponding to epitopes B, D and E in H3 HA), the 
receptor-binding site, the polybasic cleavage site, and the trans- 
membrane region (Table S2). Given that some of the transmission 
bias residues are in epitopes, avian strains most likely to infect 
humans could have distinct epitope profiles from other avian 
strains. The mapping of these residues onto the protein structure 
of H5N1 HA [17] shows that most of these residues are in the 
head region of the HA protein structure (Figure S2). We also find 
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Figure 3. Average annual frequency of significant HA mutations responsible for geography specific transmission bias of H5N1. 

Average annual frequencies of the major amino-acid at significant residues (Table S1) for human isolates (red), avian isolates which cluster with 
human isolates (teal), and other avian isolates (black) from Egypt (A), China (B) and Indonesia (C). The grey bars represent two standard deviation 
variation in the observed annual frequency. 
doi:1 0.1 371 /journal.pone.01 00754.g003 



that all except one of the high frequency residues identified in 
Table 1 arose in the avian viral pool of the region in either the 
same or the previous year, as first reported in human infections, 
which further suggests their relevance to human infections. 

In Egypt, H5N1 strains resistant to Mexican H5N2 vaccine 
are less likely to infect humans 

Avian strains in Egypt have undergone diversification in 
response to vaccine induced selection pressure in poultry 
[20,21]. These antigenically drifted avian isolates are now 
classified as a variant group within the sub-clade 2.2.1 (group I 
in [21]). The mutations characteristic of these isolates are S-74, N- 
97, R-110, P-123, G-140, P-141, Y-144, K/E-162, H-165, E-184, 
and V-226 [19]. Our results identified all but two (G-140 and E- 
184) of the residues characterizing this vaccine resistant avian 
H5N1 group (Table 1 and Table S2). However, the mutations 
characterizing human isolates were distinct from the mutations characteristic of 
the escape mutant group. Specifically, residues 74, 97, 110, 123, 144, 
and 165 have virtually conserved amino-acids in closely clustering 
human and avian isolates, which are different from those 
characterizing the variant group of avian isolates. Since 70 out 
of 7 1 human isolates from our Egypt dataset cluster closely, this 



finding suggests that the variant avian strains of sub-clade 2.2.1 
from vaccinated birds are unable to infect humans. 

A serological study using reverse genetically designed viruses 
carrying the above mentioned variant group specific mutations 
showed that the mutations S-74, G-140, P-141, Y-144 and K-162 
are involved in escape from neutralization due to Mexican H5N2 
vaccine induced antibodies in chickens [12]. A comparative PC A 
analysis of the H5N1 isolates from Egypt carrying the high 
frequency transmission bias mutations versus the vaccine evasion 
mutations (Figure 4A-B) found that the closely-clustering human 
and avian isolates carry 0-1 out of the 5 vaccine evasion 
mutations, whereas the more divergent avian isolates carry 3-5 
mutations. We also find that the cluster of escape mutant avian 
isolates is distinct from the cluster containing human isolates, and 
carries 0-1 of the 8 high frequency transmission bias mutations (P- 
74, D-97, HI 10, S-123, S-141, F-144, N-165, and M-226). These 
results suggest that the mutations involved in vaccine evasion, at 
least in this instance, have resulted in inefficient transmission of 
avian H5N1 strains to humans. Thus, the potential of avian H5N1 
strains to infect humans seems to have been effectively neutralized 
by the use of the Mexican-derived H5N2 vaccine on poultry in 

Egypt. 
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Figure 4. Transmission bias mutations compared to vaccine evasion mutations in Egypt. PCA plots distinguish the frequencies of 
transmission bias loci (A) and vaccine evasion mutations (B). In (A) transmission bias mutations P-74, D-97, H-1 10, S-123, S-141, F-144, N-165 and M- 
226 (total = 8) have a high frequency in human isolates and closely clustering avian isolates but not in other avian isolates. In (B), vaccine-evasion 
mutations S-74, G-140, P-141, Y-144 and K-162 (total = 5) from Cattoli et al [12], which are responsible for resistance to the Mexican H5N2 vaccine 
strain commonly used in commercial poultry farms in Egypt are overrepresented in a different subcluster compared to the transmission bias 
mutations (A). 

doi:1 0.1 371 /journal.pone.01 00754.g004 



Residues associated with human H5N1 isolates after 
correcting for biased transmission 

We investigated the possibility of additional residues associated 
with human infections after correcting for the transmission biases 
described above. Such residues should display a) significant amino 
acid frequency differences between human isolates and the subset of 
closely clustering avian isolates, and b) significantly low probabilities of 
having evolved neutrally from the subset of closely clustering avian isolates 
of the previous year. The residues that have these properties are listed 
in Table S4 for each geographic region and year. The identified 
residues are in the epitope D region of the H3 hemagglutinin 
(residues 184—186 in isolates from Indonesia), and near the trans- 
membrane site (residue 513 in isolates from Egypt). The T513I 
mutation in Egypt arose in 2006, and for the years 2007 and 2008 
showed enrichment in human isolates (8 out of 1 9 and 3 out of 7 
human isolates respectively) as compared with the closest avian 
isolates (frequencies 0.07-0.15), but not for the year 2009 (Figure 
S3A). The mutation N-184 arose in 2005 in isolates circulating in 
Indonesia, and showed enrichment in the reported human isolates 
in 2007 (5 out of 5) over the avian isolates (frequency <20%) 
(Figure S3B). The mutations E-185 and E-186 are in close linkage 
with the N-184 mutation and show similar enrichment in human 
isolates as compared with avian isolates from Indonesia from the 
year 2007 (data not shown). 

Discussion 

Using HA sequence data, we identified a pronounced popula- 
tion substructure in the H5N1 strains in each geographical region 
studied, with human isolates clustering together with a subset of 
avian isolates (Figure 2). To identify the residues that could be 
important for avian to human transmission of H5N 1 , we designed 
a novel analysis based on comparison of human and avian isolates 
using two complimentary measures (Figure 1). This analysis of 
avian and human isolates in each geographic region identified 
specific mutations characterizing subsets of avian isolates with 
increased potential for infecting humans (Table 1, Table S2). 



These residues are in or near the epitope regions, the receptor 
binding site, and the polybasic cleavage site of the HA proteins. 
They have high frequencies for specific amino-acids at specific loci 
in human and closely clustering avian isolates and a significantly 
lower frequency in other avian isolates (Table 1, Figure 3). This 
suggests that not all avian strains can efficiently infect humans. 
Instead, only an identifiable subset, with specific amino acids at 
identified residues, can do so. The amino acids at these identified 
residues seem to be important for the H5N1 viruses to infect 
humans in specific geographic regions. 

A possible biological reasons for the enrichment of the identified 
mutations in human isolates relative to avian isolates might be that 
they are involved in efficient binding to receptors on human 
epithelial cells and/ or in evasion of neutralization by the human 
immune response. It is known from the study of Yamada et al. [1 7] 
that HA from human isolates has the ability to bind to cells with 
both the avian-type (oc2,3) and human-type (oc2,6) sialic acid 
receptors, whereas HA from avian isolates can bind only to the 
avian type sialic acid receptors. Watanabe et al. [22] have studied 
mutations responsible for receptor binding properties of human 
infecting H5N1 isolates circulating in Egypt and found that isolates 
with increased affinity for human-type sialic acid binding also 
retained binding to the avian-type sialic acid receptors. They 
showed that mutations at residue 192 and at residues 129 in 
combination with 151 (also identified in Table 1) enhanced the 
binding to the human-type sialic acid receptors, while still 
retaining binding to avian-type sialic acid receptors. 

However, we note that Watanabe et al. [22] also showed that an 
older reference avian strain (without the mutations at 129,151 and 
192), but with some of the amino acids identified in Table 1, does 
not bind efficiently to human-type sialic acid receptors. In 
addition, they showed that reverse-genetically designed isolates 
with specific mutations at 129, 151 and 192, in the background of 
the reference strain, increased the virulence of H5N 1 in mice as 
compared to the original reference avian strain. These results 
suggest that whereas the identified mutations in Table 1 may not 
be directly responsible for increased human-type sialic acid 
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receptor binding or increased virulence, they may be involved in 
complex (possibly background dependent) interactions that 
enhance infectivity in humans. Given that the mutations identified 
in Table 1 are almost conserved within human isolates and closely 
clustering avian isolates, but are at low frequencies in the other 
avian isolates, and that these mutations were present in the isolates 
in the study by Watanabe et al., these mutations may be a pre- 
requisite for higher human-type receptor binding and/ or higher 
virulence. 

Another possible reason for enhanced/ diminished transmission 
of some H5N1 strains from birds to humans could be vaccine- 
induced diversification of avian viruses. In Egypt, the identified 
residues conserved in the closely clustering human and avian 
isolates have also been shown to be involved in vaccine evasion 
[12]. Cattoli et al. studied the effect of Mexican H5N2 strain 
induced antibodies in chickens on a divergent clade of avian 
H5N1 isolates from commercial poultry farms in Egypt, where the 
Mexican H5N2 vaccines are used extensively. Using reverse 
genetics and serological studies, they found the mutations at 
residues 74, 140, 141, 144 and 162 to be important for the 
vaccine-resistance of the divergent clade of H5N1 isolates. Our 
results identified all the above residues, except residue 140. We 
find that the residues at 74, 141 and 144 are almost conserved in 
human isolates and are different from the ones involved in 
vaccine-resistance (Figures 3, 4). The mutual exclusivity of the 
presence of vaccine-evading mutations and the mutations enriched 
in human isolates (Figure 4A-B) in H5N1 strains suggests that 
during acquisition of vaccine-evasion, the divergent avian strains 
lose the ability to infect humans. China and Indonesia have also 
used vaccines for H5N1 control [1 1], and it would be interesting to 
investigate whether the mutations associated with human infec- 
tions also provide resistance to vaccines in poultry. 

The reduced efficiency of vaccine evading avian H5N 1 strains 
to infect humans could arise due to a) low viral loads in vaccinated 
poultry [23] leading to reduced transmission to humans; b) the 
escape mutant virus being a poor transmitter in general compared 
to the wild-type strains; or c) vaccine induced molecular changes 
making the mutant strains transmit inefficiently to humans. 
Immune escape of influenza viruses is intimately connected with 
receptor binding [24,25]. Thus, it is possible that the vaccine 
evasion in avian H5N1 strains could be accompanied by changes 
in receptor binding properties leading to a reduced infectivity in 
humans. In any event, our results show that vaccination with the 
Mexican H5N2 strain evolved the virus away from human 
infectivity. Cattoli et al. showed that avian reference strain with 
human-specific mutations can be neutralized by antibodies 
induced in chickens by vaccination with Mexican H5N2 [12]. 
Thus, although the intensive use of Mexican H5N2 vaccine has led 
to the development of vaccine-resistant avian H5N1 isolates, this 
vaccine could prove beneficial to control human infections in 
Egypt. To our knowledge, this is the first observation that selection 
pressure from some types of vaccination of poultry may be driving 
H5N1 away from being able to infect humans. Our analysis 
suggests that appropriate vaccination of poultry designed to be 
effective against specific epitopes enriched in human isolates may 
significandy mitigate the risk of human infections. However, 
several environmental effects, which cannot be adequately assessed 
from the literature, need to be taken into account. For instance, 
one needs to understand the fractions of human infections from 
backyard versus commercial poultry, and the degree to which 
poultry infected with vaccine escape mutant viruses came in 
contact with humans at all. Laboratory studies using viral epitopes 
identified in this paper would be the simplest way to prove or 
disprove our claims. 



After correcting for transmission bias of H5N1 isolates from the 
avian viral pool to humans, we found that certain residues have a 
high frequency in human isolates compared to the closely 
clustering avian isolates. These residues may increase the 
likelihood of human infectivity in the particular genetic back- 
ground of avian H5N1 strains that are more likely to infect 
humans (Table S4, Figure S3). In particular, the residues identified 
to have this property in Indonesian isolates are in the region 
corresponding to epitope D of H3 HA. The residue 5 1 3, which 
was identified in Egypt, lies close to the trans-membrane site of 
HA. Intriguingly, these residues do not have significant scores 
when human isolates are compared with all the avian isolates. The 
amino-acid frequencies at these sites in the human isolates are not 
significandy different from those in the entire avian pool, but differ 
only from the closely clustering avian isolates. This suggests the 
rather unusual possibility that loss of the wild type alleles at these 
mutations could enhance human infectivity in the genetic 
background of avian isolates that closely cluster with the human 
isolates. 

In summary, in each geographic region, only certain identifiable 
subgroups of avian H5N 1 isolates seem able to infect humans, and 
selection pressure from vaccination has created escape mutants 
that are unable to infect humans. Experimental investigation of 
these results would provide additional insights into the biological 
mechanisms underlying enhanced human infectivity of certain 
H5N1 strains as well as on how vaccination pressure affects the 
ability of H5N1 avian viruses to infect humans. 

Materials and Methods 

Sequence data 

Aligned amino acid and nucleotide sequences for hemagglutinin 
of H5N1 isolates were downloaded from the NCBI Influenza 
Virus Resource database (http:/ /www.ncbi.nlm.nih.gov/ 
genomes/FLU/Database/nph-select.cgi), on August 8, 2012 
(Egyptian isolates) and October 18, 2012 (Asian isolates). 
Alignment was performed using the program MUSCLE using 
default parameters. Identical strains were removed using both the 
web resource's option and additional programming (to account for 
identity up to missing residues) with unique human isolates 
preferably retained from a set of identical isolates. Host, region, 
and year information for all isolates wwas also downloaded from 
the above website. The resulting dataset comprised 1209 (153 
human, 1056 avian) isolates in Egypt, China, and Indonesia from 
years 1996-2011. Human isolates for all geographical regions 
combined were from years 2005-2010. 

Principal Component Analysis (PCA) 

We performed PCA on hemagglutinin amino acid and 
nucleotide sequence data for isolates from both avian and human 
hosts from each region to understand population structure. Both 
amino acid and nucleotide sequences in the dataset had sites with 
more than two variants. To encode these amino acids or 
nucleotides into numerical values, we used the following prescrip- 
tion. Amino acids at each residue were assigned values 0,1, 2,..., 19, 
with the most common variant assigned to 0, the next frequent 1 , 
and so on. In all of the isolates in each geographic region, we 
excluded residues with a missing amino acid, which would indicate 
a deletion or missing sequence. The numerical data for each 
residue was normalized by subtracting the mean. However, we did 
not divide the result by the standard deviation to ensure that the 
more variant sites carry higher weight in the PCA analysis. The 
PCA analysis was done using the module for Singular Value 
Decomposition in SciPy [26]. 
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Identifying the avian strains that closely cluster with 
human strains using a distance cutoff in PCA space 

PC A on H5N1 isolates from avian and human hosts in each 
geographic region revealed that isolates from each region 
exhibited signs of population substructure (Results). To understand 
this population sub-structure and its relevance to transmission bias, 
we constructed sub-clusters in PCA space by first clustering human 
isolates that were close to one other, and then clustering avian 
isolates that were close to these human isolates. More specifically, 
we first retained only those PCA components that accounted for > 
4% variance. We then constructed clusters of human isolates that 
fell within a distance corresponding to 4% variation of the total 
variation in each local region. We found that by using this distance 
cutoff, almost all (>80%) of the human isolates in each region 
clustered together. 

We used the following algorithm for clustering human isolates. 
Initially all human isolates were placed in the un-clustered list. 
Because each human isolate belongs to a cluster (albeit of size one), 
we chose a random isolate to seed the first cluster and removed it 
from the un-clustered list of isolates. In the next step, all isolates 
within the distance cutoff from this initial isolate were included in 
the cluster, and removed from the list of un-clustered isolates. If 
the cluster size was greater than one, then new un-clustered 
isolates were added to this cluster if they were closer than the 
distance cutoff to at least one of the cluster isolates. This step was 
iterated until there were no more isolates in the un-clustered list 
that were within the distance cutoff to any of the cluster isolates. 
To construct the next cluster, an isolate was randomly chosen from 
the un-clustered list of isolates, and the same algorithm was 
repeated. The construction of clusters ended when the continu- 
ously updated list of un-clustered isolates contained no remaining 
isolates. For each geographic region we found that most (>80%) 
human isolates formed a single cluster using the distance cutoff of 
4% of total variance. 

We then identified all the avian isolates that fall within a 
distance corresponding to 2% variation to all the human isolates in 
the identified cluster (Figure SI shows a schematic representation 
of this method). This subset of the avian isolates was then used to 
identify the set of avian isolates closest to the human isolates. PCA 
plots showing human and closely clustering avian isolates are 
shown in Figure 2. 

Detection of residues in the human isolates with 
significant amino acid frequency differences from the 
avian isolates 

For isolates from each region and year, we computed the 
significance of differences in amino acid frequencies at each 
residue between the human and avian isolates. We treat amino 
acids at a residue in human isolates from a given year and region 
as samples drawn from the distribution of amino acids present at 
the same residue in the avian isolates of the same year. We then 
used the multinomial formula for sampling to evaluate the 
likelihood of sampling the human amino acid configuration from 
the amino acid distribution from the avian isolates. At a given 
residue in isolates from a given year and region, let [n\,n 2 ,n^,,..\ be 
the observed counts for amino-acids \aa\,aa 2 ,aa^,..\ in the human 
isolates, and let \pi,P2,P3,-~] be the corresponding amino-acid 
frequencies in the avian isolates. The likelihood of seeing these 
counts in human isolates, given that they are sampled randomly 
from the avian isolates, is given by 

L([n u n 2 ,n 3 ,...]; \p u p 2 ,p 3 ,...]) = N\(p" l l p" 2 2 p" 3 :i ...)/(«i \n 2 \n 3 \...) 



where N = n\ + n 2 + «3 + ... is the total number of human isolates 
from a given year and region. 

An empirical p-value for this likelihood was computed by 
drawing 10 8 random sample sets of equal size to the human 
isolates from the distribution of amino-acid residues in avian 
isolates, and counting the fraction of such realizations with a lower 
likelihood than observed. To correct for population structure in 
H5N1 isolates from each geographic region, we performed this 
analysis only on the subset of avian isolates that clustered closest to 
human isolates, as described in the previous section. 

Detection of residues in human isolates with low 
probability of evolving neutrally from the avian H5N1 
isolates 

We adapted the method introduced by Pan and Deem [18] to 
compute the probabilities of evolving the observed amino-acid 
configurations at each site of human isolates from the avian 
isolates of the previous year. If the amino-acid frequencies in the 
avian isolates for the previous year (sayjy-7) were observed to be 
\fl>P2j>3,—], then the theoretically evolved frequencies 
\Pe,l>Pe,2,Pe,3>---] can be computed using the protein evolution 
model Q as 

\Pe, 1 ,Pe,2 ,Pe,3,---] = \Pl,P2,P3,---]»e Q ' 

where [..] is a row vector, Q is a 20*20 matrix, and t is measured in 
units of mutation rate, which we assume to be the substitution rate 
of 4.77*10~Vsite/year [27]. For Q, we used an Influenza specific 
protein evolution model FLU, as calculated by Dang et al [19]. 
Using these evolved frequencies, we compute the likelihood of 
observing the amino-acid configuration of human isolates of year y 
as before: 

L([tli ,«2,«3,-] ; \Pe,\ ,Pe,2 ,Pe,3 ,-}) = ^(/OK 2^3 -)/(»l !»2 !»3 ! -) 

To compute the significance (p-value) of this likelihood value, 
we randomly generated 10 8 samples from the evolved distribution 
\Pe,l,Pe,2,Pe,3,---] an d calculated the likelihood using the above 
formula for each of these samples. The empirical p-value of the 
observed likelihood value is the fraction of these 1 0 samples with 
lower likelihoods than the one observed. 

Sensitivity to ascertainment bias 

The database contained far fewer human isolates than avian 
isolates, with some years having very few (sometimes only 10) 
human isolates. In such cases, a few outiier samples could bias the 
results. To correct for this, we studied the sensitivity of our results 
when only a subset of the full dataset is used. We randomly chose 
1,000 subsets containing 75% of human and 1,000 subsets of 75% 
of avian isolates in each year. Our analysis was repeated on the 
1,000*1,000= 10 6 combinations of these subsets of human and 
avian isolates. We then calculated the mean and standard 
deviations for the likelihoods of amino acid frequency difference 
and likelihood of neutral evolution using the methods described 
above for all the combinations. In our final results we only retained 
those residues which either had a mean likelihood of amino acid 
frequency difference <10 -5 and neutral evolution likelihood < 
10 3 or vice versa. 
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Sensitivity to mutation rate variation 

Because the computation of probabilities of neutral evolution of 
human isolates from the avian viral isolates uses mutation rate as 
an input parameter, we studied the sensitivity to local variation in 
mutation rates. We first used the program PhyML [28] to generate 
the maximum likelihood values for the 4 mutation rate classes of a 
discrete T4 model of variable mutation rates. We used both 
human and avian isolates from Egypt to be analyzed using PhyML 
using the model FLU and other default parameters. We obtained 
the maximum likelihood values for 4 classes of the discrete T4 
model to be {0.0288, 0.2353, 0.8012, 2.9346}, which we 
multiplied with the mean rate of 4.77*10~'V site/year [28] to get 
the 4 classes of mutation rates. We then calculated the likelihood of 
neutral evolution for all the significant sites using each of these 
rates, and found that all the significant residues had mean 
likelihood <10" 5 . 

Supporting Information 

Figure SI Schematic of clustering algorithm in PCA 
space. The figure shows the first two principal components PCA 
of HA amino acid sequences from avian and human isolates from 
Egypt. A) A sphere of proximity with radius corresponding 1 % of 
the total variance was constructed around each human isolate, and 
all human isolates connected by overlapping circles were clustered 
together. Avian isolates located in the same cluster as human 
isolates were added to the cluster. B) Results from the 
implementation of the algorithm using the first 4 principal 
components and spheres of proximity with radius corresponding 
to 2% of the total variance. 
(PNG) 

Figure S2 Location of significant residues on H5 
Hemagglutinin. Loci on the H5 hemagglutinin protein which 
show transmission bias. Structure data was obtained from PDB file 
2IBX [15], and was analyzed using the program PyMol. Red, blue 
and purple spheres represent loci under transmission bias in 
Indonesia, Egypt and China respectively. 
(PNG) 

Figure S3 Loci with significantly different frequencies 
after correcting for biased transmission of H5N1. A) The 

1-513 mutation in HA from H5N1 strains circulating in Egypt 
shows enrichment in human isolates in years 2007 and 2008 
compared to avian isolates. B) The N-183 mutation in HA from 
H5N1 strains circulating in Indonesia shows enrichment in human 
isolates compared to avian isolates. The numbers in parentheses 
are the number of human isolates in each region and year. 
(PNG) 

Table SI Table SI A, Closely clustering avian and human 
isolates (methods) identified using HA sequences from H5N1 
strains in Egypt; Table SIB, Closely clustering avian and human 
isolates (methods) identified using HA sequences from H5N1 
strains in China; Table SIC, Closely clustering avian and human 
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